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Abstract 

We motivate and analyse a new Tree Search algorithm, GPTS, based on recent theoretical 
advances in the use of Gaussian Processes for Bandit problems. We consider tree paths 
as arms and we assume the target/reward function is drawn from a GP distribution. The 
posterior mean and variance, after observing data, are used to define confidence intervals for 
the function values, and we sequentially play arms with highest upper confidence bounds. 

We give an efficient implementation of GPTS and we adapt previous regret bounds 
by determining the decay rate of the eigenvalues of the kernel matrix on the whole set of 
tree paths. We consider two kernels in the feature space of binary vectors indexed by the 
nodes of the tree: linear and Gaussian. The regret grows in square root of the number of 
iterations T, up to a logarithmic factor, with a constant that improves with bigger Gaussian 
kernel widths. We focus on practical values of T, smaller than the number of arms. 

Finally, we apply GPTS to Open Loop Planning in discounted Markov Decision Pro- 
cesses by modelling the reward as a discounted sum of independent Gaussian Processes. 
We report similar regret bounds to those of the OLOP algorithm. 

Keywords: Bandits, Gaussian Processes, Tree Search, Open Loop Planning, Markov 
Decision Processes 



1. Introduction 



In order to motivate the work presented here, we first review the problem of tree search 
and its bandit-based approaches. We motivate the use of models of arm dependencies in 
bandit problems, for the purpose of searching trees. We then introduce our approach based 
on Gaussian Processes, that we analyse in the rest of this paper. 



1.1 Context 



Tree search consists in looking for an optimal sequence of nodes to select, starting from 
the root, in order to maximise a reward given when a leaf is reached. We introduce this 
problem in more detail, we motivate the use of bandit algorithms for tree search and we 
review existing techniques. 
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1.1.1 Tree search 

Applications Tree search is important in Artificial Intelligence for Games, where the 
machine represents possible sequences of moves as a tree and looks ahead for the first move 
which is most likely to yield a win. Rewards are given by Monte Carlo simulations where we 
randomly finish the game from the current position and return 1 for a win, otherwise. Tree 
search can also be used to search for an optimum in a space of sequences of given length, 
as in sequence labelling. More generally, it can be used to search any to pological space for 
which a tree of coverings is defined, as shown by iBubeck et aD (j2009h . where each node 



corresponds to a region of the space. For instance, if the space to search is a d-dimensional 
hyper-rectangle, the root node of the tree of coverings is the whole hyper-rectangle, and 
children nodes are defined recursively by splitting the region of the current node in two: 
each region is a hyper-rectangle and we split in the middle along the longest side. 

Planning in Markov Decision Processes In MDPs, an agent takes a sequence of 
actions that take it into a sequence of states, gets rewards from the environment for each 
action it takes, and aims at maximising its total reward. Alternatively, a simpler objective 
is to maximise the discounted sum of rewards the agent gets: a discount factor < 7 < 1 
is given beforehand and a weight of 7* is applied to the reward obtained at time t, for 
all t. If a generative model of the MDP is available (i.e. given a state we can determine 
the actions available from this state and the rewards obtained for each of these actions, 
without calling the environment), then we can represent the possible sequences of actions 
as a tree and determine the reward for each path through this tree (as a discounted sum 
of intermediate rewards). The idea of using bandit algorithms in th e search for an optimal 



action in large state-space MD Ps (i.e. plann i ng) w as introduced by lKocsis and Szepesvari 



( 20061 ) and also considered by Chang et al. ( 20071 ) 0, as an alternative to costly dynamic 



programming approaches that aim to approximate the optimal value function. 

Challenges Searching trees with large branching factors can be computationally chal- 
lenging, as applications to the game of Go have shown. It requires to efficiently select 
branches to explore based on their estimated potential (i.e. how good the reward can be at 
leaves of paths going through this node) and the uncertainty in the estimations. Similarly, 
high depths can be unattainable due to lack of computational time and bad selection of 
the branches to explore. A tree search algorithm should not waste too much time in ex- 
ploring sub-optimal branches, while still exploring enough in order not to miss the optimal. 
Bandit algorithms can be used to guide the selection on nodes in the exploration of the 
tree, based on knowledge acquired from previous reward samples. However, one must be 
cautious that the process of selecting the best nodes to explore first doesn't become itself 
too computationally expensive. In the work of Gelly and Wang ( 20061 ) on the search of Go 



game trees, bandit algorithms allow a more efficient exploration of the tree compared to 
traditional Branch & Bound approaches (Alpha-Beta). 



1. Bandits are also used by lOrtnerl l|2010l ) for closed-loop planning (where the chosen actions depend on 
the current states) in MDPs with deterministic transitions. 
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1.1.2 Bandit problems 

The bandit problem is a simple model of the trade-off between exploration and exploitation. 
The multi-armed bandit is an analogy with a traditional slot machine, known as a one-armed 
bandit, but with multiple arms. In the stochastic bandit scenario, the player, after pulling 
(or 'playing') an arm selected from the finite set of arms, receives a reward. It is assumed 
that the reward obtained when playing arm i is a sample from a distribution Ri, unknown 
to the player, and that samples are iid. A stochastic bandit problem is characterised by a 
set of probability distributions Ri,i<i<N El- 
Measure of performance The objective of the player is to maximise the collected re- 
ward sum (or 'cumulative reward') through iterative plays of the bandit. The optimal arm 
selection policy S*, i.e. the policy that yields maximum expected cumulative reward, con- 
sists in selecting arm i* = argmaXj{Ei?j} to play at each iteration. The expected cumulative 
reward of S* at time t (after t iterations) is tERi* . The performance of a policy S is assessed 
by the analysis of its cumulative regret at time T, defined as the difference between the 
expected cumulative reward of S* and S at time T. 

Exploration vs. exploitation A good policy requires to optimally balance the learning 
of the distributions Ri and the exploitation of arms which have been learnt as having 
high expected rewards. When the number of arms is finite and smaller than the number 
of experiments allowed, it is possible to explore all the possible options (arms) a certain 
number of times, thus building empirical estimates MR4, and exploit the best performing 
ones. As the number of times we play the same arm i grows, we expect our reward estimate 
to improve. 

Optimism in the face of uncertainty A popular strategy for balancing exploration and 
exploitation consists in applying the so-called "optimism in the face of uncertainty" princi- 
ple: reward estimates and uncertainty estimates are maintained for each arm, such that the 
probability that the actual mean-reward values are outside of the confidence intervals drops 
quickly. The arm to be played at each time step is the one for which the upper bound of the 
confidence inte rval is the h i ghest. This strategy, as implemented in the UCB algorithm, has 



been shown bv lAuer et al.l (|2002l ) to achieve optimal regret growth-rate for problems with 
independent arms: problem-specific upper bound in 0(log(T)), and problem-independent 
upper bound in 0(\/T)|. 

1.1.3 Bandit-based Tree Search algorithms 

Typically, algorithms proceed in iterations. After the t th iteration, a leaf node nt is selected 
and a reward yt is received. It is usually assumed that there exists a mean-reward function 
/ such that yt is a noisy observation of f(nt). Other common assumptions are that /*, the 
highest value of /, is known (or an upper bound on /* is known) and is always bigger than 



2. Non-stochastic bandit problems are al so of interest, as well as problems in which the distributions are 
allowed to change through time (see iBubeckl . |2010| . for an overview of the different types of bandit 
problems) . 

3. A regret bound is said to be problem-specific when it involves quantities that are specific to the current 
bandit problem, such as the sub-optimalities = W.Ri* — E.R; of arms, based on the means of the 
distributions for this problem. The second bound, however, does not involve such quantities. 
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yt- The algorithm stops when a convergence criterion is met, when a computational/time 
budget is exhausted (in game tree search for instance), or when a maximum number of 
iterations has been specified (this is referred to as "fixed horizon" exploration, by opposition 
to "anytime"). In the end, a path through the tree is given. This can simply be the path 
that leads to the leaf node that received the highest reward. 

Path selection as a sequ ence of bandit problems The algorithm developed by 
Kocsis and Szepesvari (2006), UCT, considers bandit problems at each node in the tree. 



The children of a given node represent the arms of the associated bandit problem, and the 
rewards obtained for selecting an arm are the values obtained at a leaf. At each iteration, 
we start from the root and select children nodes by invoking the bandit algorithms of the 
parents, until a leaf is reached and a reward is received, which is then b ack propagated to 
the ancestors up to the root. The bandit algorithm used in UCT is UCB (jAuer et al.l . l2002h 



which stands for Upper Confidence Bounds and implements the principle of optimism in 
the face of uncertainty @- 



Smooth" trees Although iGelly and W ang (2006) reported that U CT performed very 



well on Go game trees, it was shown by Coquelin and Munosl ( 20071 ) that it can behave 



poorly in certain situations because of "overly o ptimi stic assumptions in the design of its 
upper confidence bounds" ( Bubeck and Munosl . 2010l ). leading to a high lower bound on 



its cumulative regret. An other algorithm was proposed, BAST (Bandit Algorithm for 
Smooth Trees), which can be parameterised to adapt to different levels of smoothness of 
the reward function on leaves, and to deal with the situations that UCT handles badly. 
BAST is only different from UCT in the definition of its 'upper confidence bounds' (UCT is 
actually a special case of BAST, corresponding to a particular value of one of the algorithm's 
parameters). A time-independent regret upper bound was derived, however it was expressed 
in terms of the sub-optimality values Aj of nodes (dependent on the reward / on nodes, 
hence unknown to the algorithm) and was thus problem specific. Also, quite paradoxically, 
the bound could become very high for smooth functions (because of 1/Aj terms). 

Optimistic planning in discounted MDPs The discount factor implies a particular 
smoothness of the function / on tr ee paths (the smaller j, the smoother the function) , which 
is the starting point of the work of Bubeck and Munos ( 2010l ) on the Open Loop Optimistic 



Planning algorithm, close in spirit to BAST. OLOP has been proved to be minimax optimal, 
up to a logarithmic factor, which means that the upper bound growth rate of its simple 
regret El matches the lower bound. However, OLOP requires the knowledge of the time 
horizon T and the regret bounds do not apply when the algorithm is run in an anytime 
fashion. 

Measure of performance A Tree Search algorithm's performance can be measured, as 
for a bandit algorithm, by its cumulative regret Rt = Tf* — Y^t=i /("-*)• However, although 
this is a good objective to achieve a good exploration/exploitation balance, we might be 
ultimately interested in a bound on how far the reward value for the best node we would 



4. In the tree setting however, rewards are not iid and the values used by the UCB algorithms at each node 
are not true upper confidence bounds. 

5. The simple regret is defined as the difference between /* and the best value of / for the arms that have 
been played. 
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see after T iterations is from the optimal /*. Or it might be more useful to bound the 
regret after a given execution time (instead of a number of iterations) in order to compare 
algorithms that have different computational complexity. 

1.2 Many-armed bandit algorithms 

It is of interest to consider bandit problems in which there are more arms than the possible 
number of plays, or in which there is an infinity of arms. We refer to this as the "many- 
armed" bandit problem. In this case, we need a model of dependencies between arms in 
order to get, from one play, information about several arms - and not only the one that 
was played. We show how such models can be applied to online global optimisation. In 
particular, we review the use of Gaussian Processes for modelling arm dependencies. 



1.2.1 Bandits for online global optimisation 

Bandit algorithms have been used to focus exploration in global optimisation. Each point 
in the space of search is an arm, and rewards are given as we select points where we want 
to observe the function. Even though the actual objective may not be to minimise the 
cumulative regret but to minimise the simple regret, we have seen above how a bound 
on the former can give a bound on the latter. The cumulative regret is also interesting 
as it forces algorithms not to waste samples. Samples can be costly to acquire in certain 
applications, as they might involve a physical and expensive action for instance, such as 
deploying a sensor or taking a measurement at a pa rticular location (see the experiments on 
sensor networks performed by Srinivas et al. . 20ld ). or they can simply be computationally 
costly: the less samples, the quicker we can find a maximum. 

Modelling dependencies The observations may or may not be noisy. In the latter case, 
the bandit problem is trivial when the search space has less elements than the maximum 
number of iterations we can perform. But in global optim i sation , the search space is usually 
continuous. In that case, as pointed out by Wang et al. ( 20081 ). if no assumption is made 
on the smoothness of /, the search might be arbitrarily hard. The key idea is to model 
dependencies between arms, through smoothness assumptions on /, so that information 
can be gained about several arms (if not the whole set of arms) when playing only one 
Modelling dependencies i s also benefici a l in p roblems with finite numbers of arms, 

Pandev et al.l ( 20071 ) have developed an algorithm which 



arm. 

as it speeds up the learning. 



exploits cluster st ructures among arms, applied to a content-matching problem (matching 
webpages to ads). lAuer and Shawe-Tavlorf(|2010h use a kernelised version of LinRel, a UCB- 



type algori t hm in troduced by Auer ( j2003l ) for linear optimisation and further analysed by 
Dani et al.l (|2008l ). for an image retrieval task with eye- movement feedback. LinRel has a 
regret in 0(VT), i.e. that grows m VTup to a logarithmic termH. 

Continuous arm spaces Band it problems in conti nuous arm spaces have be en studied 
notably by iKleinberg etaO (|2008h . IWang et al.l (12008) 1 and IBubeck et aD h00$). To each 



bandi t problem corresponds a mean-reward function / in the space of arms. IKleinberg et al 
(2008) consider metric spaces, Lipschitz functions, and derive a regret growth-rate in 



6. The O notation is the one used by IBubeck et ail l|2009l ) and equivalent to O* used by ISrinivas et al.1 
|20ld ): u„ — (J(v n ) iff there exists a, j3 > such that u n < a log(ii n ) /3 i;„ 
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d+1 I 

Q(T d + 2 ), which strongly depends on the dimension d of the input space. iBubeck et al 
(2009), rowever, consider arbitrary topological spaces, weak-Lipschitz functions (i.e. local 



smoothness assumptions only) and derive a regret in 0(y / exp(0(d))T). The rate of growth 
is this ti me independent of t he dimension of the input space. Quite interestingly, the algo- 
rithm of lBubeck et~aH (j2009h . HOO, uses BAST on a recursive splitting of the space where 
each node corresponds to a region of the space and regions are divided in halves, i.e. all 
non-leaf nodes have two children. BAST is used to select smaller an d smaller regions to 
randomly sample / in. The algorithm developed by Wang et al. (2008), UCB-AIR, assumes 
that the probability that an arm chosen uniformly at random is e-optimal scales in e*. Thus, 
when there are many near-optimal arms and when choosing a certain number of arms uni- 
formly at random, there exists at least one which is very good with high probability. Their 

~ -J— 

regret bound is in 0(VT) when /3 < 1 and /* < 1, and in 0(T 1 +/ 3 ) otherwise. 



1.2.2 Gaussian Process optimisation 

GP assumption In the global optimisation setting, a very popular assumption in the 
Bayesian communi ty is that / is drawn from a Gaussian Process, due to the flexibility and 



power of GPs (see iBrochu et all 120091 . for a review of Bayesian optimisation using GPs) 



and their applicability in practise in engineering problems. GP optimisatio n is sometimes 
referred to as "Kriging" and response surfaces (see iGriinewalder et al.l . [201CJ, and references 
therein). GPs are probability distributions over functions, that characterise a belief on 
the smoothness of functions. The idea, roughly, is that similar inputs are likely to yield 
similar outputs. The similarity is defined by a kernel/covariance function^ between inputs. 
Parameterising the covariance function translates into a parametrisation of the smoothness 
assum ption. Note that this is a global smoothness assumption which is thus stronger than 



that of Bubeck et al 



(2Q02I). IR assumption, a probabilistic assumption 

too, although a stronger one. ISrinivas et al.l (|2O10h claim that the GP assumption is neither 
too weak nor too strong in practise. One added benefit of this Bayesian framework is the 
possibility of tuning the parameters of our smoothness assumption (encoded in the kernel) 
by maximising the likelihood of the observed data, which can be written in closed- form for 



the c ommonly used Auto Relevance Determination kernel (see Rasmussen and Williams] 



2006 . chap. 5). In comparison, parameter tuning is critical for HOO to perform well and 



parameters need to be tuned by hand. 

Acquisition of samples Similarly to bandit problems, function samples are acquired iter- 
atively and it is important to find ways to efficiently focus the exploration of the input space. 
The acquisition of function samples was often bas ed on heuristi cs, such as the Expected 
Improve ment and the Most P robable Improvement (jMockusl . Il989l ) th at proved succe s sful in 



practise ( Lizotte et al. . 2007 ). A more principled approach is that of Osborne et al. ( 20091 ) 



which considers a fixed number T of iterations ("finite horizon" in the bandit terminology) 
and fully exploits the Bayesian framework to compute at each time step t the expected loss 
H over all possible T — t remaining allocations as a function of the arm x allocated at time 
t. For this, the probability of loss is broken down into the probability of loss given the 



7. We use the terms 'kernel function' and 'covariance function' equivalently in the rest of the paper. 

8. In their approach, the loss is defined by the simple regret but one could imagine using the cumulative 
regret instead. 
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arms at times t to T, times the probability of picking these arms, which can also be broken 
down recursively. This is similar in spirit to the pioneering work of Gittins and Jones ( 19791 ) 
on bandit problems and on "dynamic allocation indices" for arms (also known as Gittins 
index). Here, computing the optimal allocation of T samples has an extremely high compu- 
tational cost H which is warranted in problems where function s amples are very expensive 



(|2010h 



themselves. The simple regret of this procedure was analysed by iGriinewalder et al 
in the case where observations are not noisy. 

UCB heuristic for acquiring samples GP approaches have been extended in the ban- 
dit setting , with the Gaussian Proce ss Upper Confidence Bound algorithm (GP-UCB or 
GPB) present e d by 



Dorard et al. (2009), for which a theoretical regret bound was given by 



Srinivas et al.l (|2010i ). based on the rate of decay of the eigenvalues of the kernel matrix on 
the whole set of arms, if finite, or of the kernel operator: 0(y/T) for the linear and Gaussian 
kernels. This seems t o match, up to a logarithm ic factor, T times the lower bound on the 
simple regret given by Griinewalder et al. ( 2010l ). which is a lower bound on the cumulative 
regret. As the name GP-UCB indicates, the sample acquisition heuristic is based on the 
optimism in the face of uncertainty principle, where the GP posterior mean and variance 
are used to define confidence intervals. Better results than with other Ba yesian acquisi- 



tion c riteria were obtained on the sensor network applications presented by lSrinivas et al 
(|2010n . There still remains the problem of findin g the maximum of t he upper confidence 



function in order to implement this algorithm, but Brochu et al. ( 20091 ) showed that global 
search heuristics are very effective. 



1.3 A Gaussian Process approach to Tree Search 

In light of this, we consider a GP-based algorithm for searching the potentially very large 
space of tree paths, with a UCB-type heuristic for choosing arms to play at each iteration. 
We consider only one bandit problem for the whole tree, where arms are tree paths 0. 
The kernel used with the GP algorithm is therefore a kernel between paths, and it can be 
defined by looking at nodes in common between two paths. The GP assumption makes 
sense for tree search as similar paths will have nodes in common, and we expect that the 
more nodes in common, the more likely to have similar rewards (this is clearly true for 
discounted MDPs). Owing to GPs, we can share in formation gained for 'play ing' a path 
with other paths that have nodes in common (which Bubeck and Munosl 2010! also aim at 
doing as state d in the last p a rt of their Introduction section). Also, we will be able to use 
the results of Srinivas et al. ( 2O10l ) to derive problem-independent regret bounds for our 



9. In their experiments, the number of iterations was only twice the dimension of the problem. 

10. In practise, rewards are taken in [0, 1] in bandit problems, but it is more convenient when dealing with 
Gaussian Processes to have output spaces centred around (easier expressions for the posterior mean 
when the prior mean is the function). With GPs, we do not assume that the / values are within a 
known interval. We previously mentioned that an upper bound on /* could be k nown, but there is no 
easy way to encode this knowledge in the prior, which is probably what motivated I Graepel et alj l|2010h 
to consider a generalised linear model with a probit link function, in order to learn the Click Through 
Rates of ads (in [0, 1]) displayed by web search engines, while maximising the number of clicks (also an 
exploration vs. e xploitation problem) . 

11. This is similar to lBubeck and Munosl l|2010l . sec. 4) where bandit algorithms for continuous arms spaces 
are compared to OLOP. 
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algorithm [3 once we have studied the decay rate of the eigenvalues of the kernel matrix on 
the set of all arms (tree paths here) , which determines the rate of growth of the cumulative 
regret in their work. 

Assumptions Similarly to BAST, we wish to model different levels of smoothness of the 
response/reward function on the leaves/paths. For this, we can extend the notion of char- 
acteristic length-scale to such functions by considering a Gaussian covariance function in a 
feature space for paths. Smoothness of the covariance/kernel translates to quick eigenvalue 
decay rate which can be used to improve the regret bound. As we already said, the param- 
eter^) of our smoothness assumption can be learnt from training data. Note that the GP 
smoothness assumption is global, whereas BAST only assumes smoothness for 77-optimal 
nodes. But in examples such as Go tree search we can expect / to be globally smooth, and 
for planning in discounted MDPs this is even clearer as / is defined as a sum of intermedi- 
ate r ewards and is thus Lipschitz with respect to a certain metric (see Bubeck and Munosl . 



201Q . sec. 4). As such, / is also made smoother by decreasing the value of the discount 
factor 7. Finally, GPs allow to model uncertainty, which results in tight confidence bounds, 
and can also be taken into account when outputting a sequence of actions at the end of the 
tree search: instead of taking the best observed action, we might take the one with highest 
lower confidence bound for a given threshold. 

Main results We derive regret bounds for our proposed GP-based Tree Search algorithm, 
run in an anytime fashion (i.e. without knowing the total number of iterations in advance), 
with tight constants in terms of the parameters of the Tree Search problem. The regret can 
be bounded with high probability in: 



• 0(Ty/log(T)) for small values of T 

• 0(y/Tlog(T)) for T > B D where B is the maximum branching factor and D the 
maximum depth of the tree El 

• 0(log(T)VT) otherwise. 

Although the rates are worse for smaller values of T, the bounds are tighter because the 
constants are smaller. For T < B D , we have a constant in \J (b-i)(d+i) ^ or ^ e nnear 

kernel, and in for the Gaussian kernel with width s: the regret improves when the 
width increases. Having small constants in terms of the size of the problem is important, 
since N = B D is very large in practise and computational budgets do not allow T to go 
beyond this value. 



1.4 Outline of this paper 

First, we describe the GP-UCB (or GPB) algorithm in greater detail and its application 
to tree search in Section [21 In particular, we show how the search for the max of the 

12. The bound will be expressed in terms of the maximum branching factor and depth of the tree, and of 
the parameters of the kernel in our model, but they won't depend on actual / values. 

13. D is considered to be fixed but we will see in Section [5] that we can extend our analysis to cases where 
D depends on T. 



8 



Gaussian Process Bandits for Tree Search 



upper confidence function can be made efficient in the tree case. The theoretical analysis of 
the algorithm begins in Section [3] with the analysis of the eigenvalues of the kernel matrix 
on the whole set of tree paths. It is followed in Section by the derivation of an upper 
bound on the cumulative regret of GPB for tree search, that exploits the eigenvalues' decay 
rate. Finally, in Section [5j we compare GPB to other algorithms, namely BAST for tree 
search and OLOP for MDP planning, on a theoretical perspective. We also show how a 
cumulative regret bound can be used to derive other regret bounds. We propose ideas for 
other Tree Search algorithms based on Gaussian Processes Bandits, before bringing forward 
our conclusions. 

2. The algorithm 

In this section, we show how Gaussian Processes can be applied to the many-armed bandit 
problem, we review the theoretical analysis of the GPB algorithm and we describe its 
application to tree search. 

2.1 Description of the Gaussian Process Bandits algorithm 

We formalise the Gaussian Process assumption on the reward function, before giving the 
criterion for arm selection in the GPB framework. 

2.1.1 The Gaussian Process assumption 

Definition A GP is a probability distribution over functions, and is used here to formalise 
our assumption on how smooth we believe / to be. It is an extension of multi-variate 
Gaussians to an infinite number of variables (an iV-variate Gaussian is actually a distribution 
over functions defined on spaces of exactly N elements). A GP is characterised by a mean 
function and a covariance function. The mean is a function on X and the covariance is a 
function of two variables in this space - think of the extension of a vector and a matrix to 
an infinite number of components. When choosing inputs x a and x&, the probability density 
for outputs y a and ?/& is a 2-variate Gaussian with covariance matrix 



This holds when extending to any n inputs. We see here that the role of the similarity 
measure between arms is taken by the covariance function, and, by specifying how much 
outputs co-vary, we characterise how likely we think that a set of outputs for a finite set 
of inputs is, based on the similarities between these inputs, thus expressing a belief on the 
smoothness of /. 

Inference and noise modelling The reward may be observed with noise which, in the 
GP framework, is modelled as additive Gaussian white noise. The variance of this noise 
characterises the variability of the reward when always playing the same arm. In the absence 
of any extra knowledge on the problem at hand, / is flat a priori, so our GP prior mean 
is the function 0. The GP model allows us, each time we receive a new sample (i.e. an 
arm-reward pair), to use probabilistic reasoning to update our belief of what / may be - it 
has to come relatively close to the sample values (we are only off because of the noise) but 
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at the same time it has to agree with the level of smoothness dictated by the covariance 
function - thus creating a posterior belief. In addition to creating a 'statistical picture' of 
/, encoded in the GP posterior mean, the GP model gives us error bars (the GP posterior 
variance). In other terms, it gives us confidence intervals for each value /(x). 

2.1.2 Basic notations 



We consider a space of arms X and a kernel k between elements of X. In our model, the 
reward after playing arm x t £ X is given by /(x t ) + et, where et ~ A/"(0, c^oise) ana - / 
is a function drawn once and for all from a Gaussian Process with zero mean and with 
covariance function k. Arms played up to time t are xi, . . . , Xf with rewards J/i, • • • ,2/t- The 
vector of concatenated reward observations is denoted y^. The GP posterior at time t after 
seeing data (xi,yi), . . . , (x$, yt) has mean /zt(x) with variance of (x). 
Matrix Ct and vector k$(x) are defined as follows: 

(Ct)i,j = i^(xi,Xj) + o- noise 6ij 
(k t (x))j = «(x,Xj) 

Ut(x) and of (x) are then given by the following equations (see Rasmussen and Williams! . 

20061 . chap. 2): 



/x t (x) = k t (x) T C t - 1 yt (1) 
<j?(x) = ^x^-k^C^x) (2) 

2.1.3 UCB ARM SELECTION 

The algorithm plays a sequence of arms and aims at optimally balancing exploration and 
exploitation. For this, we select arms iteratively by maximising an upper confidence function 

/*: 

x*+i = argmax xe<v {/ t (x) = /^(x) + a/zV^x)} 

In Section 12.3.21 we show how we can find the argmax of ft efficiently, in the tree search 
problem. 

Interpretation The arm selection problem can be seen as an active learning problem: we 
want to learn accurately in regions where the function values seem to be high, and do not 
care much if we make inaccurate predictions elsewhere. The yf]3 t term balances exploration 
and exploitation: the bigger it gets, the more it favours points with high o~t(x) (exploration), 
while if y/j3 t = 0, the algorithm is greedy. In the original UCB formula, \f]3 t ~ \/log t. 

Balance between exploration an d exploitation A choice of yj]3 t corresponds to a 



choice of an upper confidence bound. ISrinivas et al.1 (|2O10h give a regret bound with high 



probability, that relies on the fact that the / values lie between their lower and upper 
confidence bounds. If X is finite, this happens with probability 1 — 5 if: 



However, the constants in their bounds were not optimised, and scaling by a constant 
specific to the problem at hand might be beneficial in practise. In their sensor network 
application, they tune the scaling parameter by cross validation. 
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2.2 Theoretical background 

The GPB algorithm was studied by Srinivas et al. ( 20ld ) in the cases of finite and infinite 



number of arms, under the assumption that the mean-reward function / is drawn from a 
Gaussian Process with zero mean and given covariance function, and in a more agnostic 
setting where / has low complexity as measured under the RKHS norm induced by a given 
kernel. Their work is core to the regret bounds we give in Section [H 

2.2.1 Overview 

Finite case analysis When all / values are within their confidence intervals (which, by 
design of the upper confidence bounds, happens with high probability), a relationship can 
be given between the regret of the algorithm and its information gain after acquiring T 
samples (i.e. playing T arms). When everything is Gaussian, the information gain can 
easily be written in terms of the eigenvalues of the kernel matrix on the training set of 
arms that have been played so far. The simplest case is for a linear kernel in d dimensions. 
However, in general there is no simple expression for these eigenvalues since we do not know 
which arms have been plavecf^l. Thanks to the result of Nemhauser et al. ( 19781 ). we can 



use the fact that the information gain is a sub-modular function in order to bound our 
information gain by the "greedy information gain" , which can itself be expressed in terms 
of the eigenvalues of the kernel matrix on the whole set of arms (which is known and fixed) , 
instead of the kernel matrix on the training set. We present this analysis in slightly more 
detail in Section [231 

Infinite case analysis The analysis requires to discretise the input space X (assuming 
it is a subspace of M. d ), and we need additional regularity assumptions on the covariance 
function in order to have all / values within their confidence intervals with high probability. 
The discretisation Xt is finer at each time step T, and the information gain is bounded by 
an expression of the eigenvalues of the kernel matrix on Xt- The expected sum of these 
can be linked to the sum of eigenvalues of the kernel operator spectrum with respect to the 
uniform distribution over X, for which an expression is known for common kernels such as 
the Gaussian and Matern kernels. 

2.2.2 Finite number of arms 

We present two main results of Srinivas et al. ( 20ld ) that will be needed in Section HI First, 



we show that the regret of UCB-type algorithms can be bounded with high probability 
based on a measure of how quick the function can be learnt in an information theoretic 
sense: the maximum possible information gain I*(T) after T iterations ("max infogain"). 
Intuitively, a small growth rate means that there is not much information left to be gained 
after some time, hence that we can learn quickly, which should result in small regrets. The 
max infogain is a problem dependent quantity and its growth is determined by properties 
of the kernel and of the input space. 



14. The process of selecting arms is non-deterministic because of the noise introduced in the responses. 
However, we could maybe determine the probabilities of arms being selected, but such an analysis would 
be problem-specific (it would depend on / values): we could, as is done in the UCB proof, look at the 
probability to select an arm given the number of times each arm has been selected so far, and do a 
recursion... which gives a problem-specific bound in terms of the Ai's. 
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Second, we give an expression of the information gain of the "greedy" algorithm, that 
aims to maximise the immediate information gain at each iteration, in terms of the eigen- 
values of the kernel matrix on X. The max infogain is bounded by a constant times the 
greedy infogain. We can thus bound the regret in terms of the eigenvalues of the kernel 
matrix. 

Notations In the following, T will denote the total number of iterations performed by the 
algorithm, Rt the cumulative regret after T iterations, rt the immediate regret at time step 
t, N = \X\ the number of arms, K the N x N kernel matrix on the set of arms, a^i<j<jv the 
feature representation of the i th arm, and x an element of the feature space (which might 
not correspond to one of the aj's). I g and I u will denote respectively the information gain 
of the greedy algorithm and of the (GP-)UCB algorithm. 



Theorem Theorem 1 of Srinivas et al. ( 20ld ) uses the fact that GPB always picks the 



arm with highest UCB value in order to relate the regret to I U (T): 



16 NT 2 tt 2 , 
logfl + a" 2 ) S(_ 6T 

^SV- 1 1 " noise/ 



R T < \l\ — — — r2-rlog( ax ) TI u{T) with probability 1-5 



Greedy infogain We define the "greedy algorithm" as the algorithm which is allowed to 
pick linear combinations of arms in X, with a vector of weights of norm equal to 1, in order 
to maximise the immediate information gain at each time step. An arm in this extended 
space of linear combinations of the a^'s is not characterised by an index anymore but by 
a weight vector. Infogain maximisers are arms that maximise the variance, which is given 
at arm x = Yli v i a -i by v T Etv where Et is the posterior covariance matrix at time t for 
the greedy algorithm and v is a weight vector of norm 1. Let Uj denote the eigenvectors 
of eigenvalues l{ (in decreasing order) of K. It can be shown that Et and K share same 
eigenbasis and that the greedy algorithm selects arms such that their weight vectors are 
among the t first eigenvectors of K. The Et eigenvalue for the i th eigenvector Uj of K is 
given by: 



Lt 



1 + a noisc m iJi 



where mn denotes the number of times Uj has been selected up to time t (we say that a 
weight vector is selected when the corresponding arm is selected). 

Consequently, Uj is selected for the first time at time t if all eigenvectors of K of indices 
smaller than j have been selected at least once and: 
Vi < ?', =?r r < L (this will be useful in Section l4~4l) . 

J ' 1+ct m, t j — J y v 

1 noise L t b 1 

An expression of the greedy infogain can be given in terms of the eigenvalues h<t<N of 

K: 

min(T,A0 
t=l 

where mt denotes the number of times Ut has been selected during the T iterations. We see 
that the rate of decay of the eigenvalues has a direct impact on the rate of growth of I g . 
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Maxi mum possible infogai n An information-theoretic argument for submodular func- 



tions ( Nemhauser et al. . 19781 ) gives a relationship between the infogain I U {T) of the GP- 



UCB algorithm at time T and the infogain I g {T) of the greedy algorithm at time T, based 
on the constant e = exp(l): 



As a consequence: 



I min(T,W) 

Iu(T) < I*(T) < — — — £ log(l + a~l c m t l t ) 



2 d ^ , , 



(3) 



It might seem that I*(T) would always be bounded by a constant because min(T, N) < 
N, which would imply a regret growth in 0(y / T log(T)) = 0(\/T). However, as we will 
see in Section [5j we may be interested in running the algorithm with a finite horizon T and 
letting N depend on T. Also, we aim to provide tight bounds for the case where T < N, 
with improved constants. The growth rate of I*(T) might become higher, but the tight 
constants will result in tighter bounds. Finally, we aim to study how these constants are 
improved for smoother kernels. 

2.3 Application to tree search 

Let us consider trees of maximum branching factor B > 2 and depth D > 1. As announced 
in the introduction, our Gaussian Processes Tree Search algorithm (GPTS) considers tree 
paths as arms of a bandit problem. The number of arms is N = B (number of leaves 
or number of tree paths). Therefore, drawing / from a GP is equivalent to drawing an 
N-dimensional vector of / values f from a multi-variate Gaussian. 

2.3.1 Feature space 

A path x is given by a sequence of nodes : x = x±, . . . ,xd+i where x\ is always the root 
node and has depth 0. We consider the feature space indexed by all the nodes x of the tree 
and defined by 

1; if 31 < i < d, x = X{ 
0; otherwise. 

The dimension of this space is equal to the number of nodes in the tree N n = ^-gzY^ ■ 

Linear and Gaussian kernels The linear kernel in this space simply counts the number 
of nodes in common between two paths: intuitively, the more nodes in common, the closer 
the rewards of these nodes should be. We could model different levels of smoothness of / 
by considering a Gaussian kernel in this feature space and adapting the width parameter. 

More kernels More generally, we could consider kernel functions characterised by a set 
of Xo<d<D decreasing values in [0, 1] where \d is the value of the kernel product between 
two paths that have d nodes not in common. Once the Xo<d<D are chosen, we can give an 
explicit feature representation for this kernel, based on the original feature space: we only 
change the components by taking ^XD~i — XD-i+i instead of 1 if a node at depth i > is 
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in the path, and we take y/XD at depth (root). Thus, consider 2 paths that differ on d 
nodes: the first 1 + D — d nodes only will be in common, hence the inner product of their 
feature vectors will be xd + {■s/XD^i — XD-i+i) 2 = Xd, which is equal to the kernel 

product between the 2 paths, by definition of Xd- Note that the kernel is normalised by 
imposing xo = 1> which will be required in Section [3,11 

2.3.2 Maximisation of f t in the tree 

The difficulty in implementing the GPB algorithm is to find the maximum of the upper 
confidence function when the computational cost of an exhaustive search is prohibitive due 
to a large number of arms - as for most tree search applications. At time t we look for 
the path x which maximises /t(x). Here, we can benefit from the tree structure in order 
to perform this search in 0{t) only. We first define some terminology and then prove this 
result. 

Terminology A node x is said to be explored if there exists x.a<t in the training data 
such that Xj contains x, and it is said to be unexplored otherwise. A sub-tree is defined here 
to be a set of nodes that have same parent (called the root of the sub-tree), together with 
their descendants. A sub-tree is unexplored if no path in the training data goes through 
this sub-tree. A maximum unexplored sub-tree is a sub-tree such that its root belongs to 
an Xj in the training data. 

Proof and procedure /t(x) can be expressed as a function of k = k^(x) instead of a 
function of x (see Equations Q] and [2]) and we argue that all paths that go through a given 
unexplored sub-tree S will have same k value, hence same ft value. Let x = x\ . . . x\ . . . x^+i 
be such a path, where I > 1 is defined such that node x\ has been explored but not Xj for 
j > I. All x's that go through S have the same first nodes x\,...,x\, and the other nodes 
do not matter in kernel computations since they haven't been visited. 

Consequently we just need to evaluate /t(x) on one randomly chosen path that goes 
through the unexplored sub-tree S, all other such paths having the same value for /t(x). 
We represent maximum unexplored sub-trees by "dummy nodes" and, similarly to leaf 
nodes, we compute and store ft values for dummy nodes. The number of dummy nodes 
in memory is 1 per visited node with unexplored siblings: it is the sub-tree containing the 
unexplored siblings and their descendants. There are at most D + l such nodes per path in 
the training data, and there are t paths in the training data, hence the number of dummy 
nodes is less than or equal to (D + l)t. 

This would mean that the number of nodes (leaf or dummy) to examine in order to 
find the maximiser of ft would be in 0(t). The search can be made more efficient than 
examining all these nodes one by one: we assign upper confidence values recursively to all 
other nodes (non-leaf and non-dummy) by taking the maximum of the upper confidence 
values of their children. The maximiser of ft can thus be found by starting from the root, 
selecting the node with highest upper confidence value, and so on until a leaf or a dummy 
node is reached. This method of selecting a path is the same as that of UCT and has a 
cost of O(BD) only. After playing an arm, we would need to update the upper confidence 
values of all leaf nodes and dummy nodes (in 0(t)), and with this method we would also 
need to update the upper confidence values of these nodes' ancestors, adding an extra cost 
in 0(t). 
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Pseudo-code A pseuco-code that implements the search for the argmax of ft in 0{t) is 
given in Algorithm [TJ We sometimes talk about kernel products between leaves and reward 
on leaves because paths can be identified by their leaf nodes. Note that with this algorithm, 
we might choose the same leaf node more than once unless cr no i se = 0. 

3. Kernel matrix eigenvalues 

For our analysis, we 'expand' the tree by creating extra nodes so that all branches have the 
same branching factor B. This construction is purely theoretical as the algorithm doesn't 
need a representation of the whole tree, nor the expanded tree, in order to run. 

3.1 Recursive block representation of the kernel matrix 

We write Kb,d the kernel matrix on all paths through an expanded tree with branching 
factor B and depth D, and J% the matrix of ones of dimension i x i. B and D completely 
characterise the tree (here, nodes don't have labels) so Kb,d is expressed only in terms of 
B and D. It can be expressed in block matrix form with Kb,d-i and Jb°- x blocks: 

K B ,i = (xo ~ Xi)I + XiJb (4) 

Kb,d-i XdJb - 1 ••• XdJb - 1 \ 

XdJb - 1 '• '• : 

: '• '• xdJb - 1 

XdJb - 1 ••• XdJb - 1 Kb,d-i ) 

where Xd is the value of the kernel product between any two paths that have d nodes not 
in common. 

To see this, one must think of the (B, D)-tiee as a root pointing to B (B, D — l)-trees. 
On the 1st diagonal block of Kb,d is the kernel matrix for the paths that go through the 
first (B,D — l)-tree. Because the kernel function is normalised, this stays the same when we 
prepend the same nodes (here the new root) to all paths, so it is Kb.d-i- Similarly, on the 
other diagonal blocks we have Kb d-i- In order to complete the block matrix representation 
of Kb,d we just need to know that any two paths that go through different (B, D — l)-trees 
only have the root in common, and we use the definition of xd- 

Let us denote by l( n )(M) and j( n )(M) the matrices of n blocks by n blocks: 

.. \ 

•• 

M ) 



jW(M) 



auu 



K b ,d 



l(")(M) 



( M 
'•• 



V 



/ M 



\ M 



M 



M 



15 



DORARD AND SHAWE- TAYLOR 



Algorithm 1 GPB for Tree Search 
% Initialisation 

create root and dummy child do 

S = {do} % se t °f arms that can be selected 

t = % number of iterations 

Xt = [] % leaf-nodes in training set 

yt = W % rewards in training set 

Kt = \\ % kernel matrix on the elements of the training set 
Ct = \\ % inverse covariance matrix 
% Iterations 
repeat 

% Choose a path 

if t == then 
x = do 

else 

choose x in S that has highest upper confidence value 
end if 

if x is a dummy node then 
% Random walk 
create sibling x 1 of x 

if all siblings of x have been created then 

delete x from the tree and remove from S 
end if 



while depth of x is strictly smaller than D do 

create x' child of x and d dummy child of x 

add d to S 

x = x' 
end while 

add x to S % x is the chosen leaf 
end if 

% Get reward and add to training set 

compute the vector of kernel products k between x and the elements of Xt 
append x to X t 
append reward(x) to yt 



for all node n in S do 

compute the vector of kernel products k between n and the elements of Xt 
compute the upper confidence of n based on k, Cf,yt and Equations CD and [2] 



t = t + l 
until stopping criterion is met 
% Define output 

look for x in Xt that had highest reward value and output the corresponding path 



x = x' 




C t = (K t + a 2 noisc I t +i) 



-l 



end for 
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We can then write: 

K b ,d = XdJ (B) (J b »-i) ~ XdI (B \Jb°-i) + I (B) (^b,d-i) 
3.2 Eigenvalues 



(5) 



For simplicity in the derivations, we consider here the distinct eigenvalues Zj in increasing 
order, with multiplicit i es Vj. We will later need to "convert" these to the l\<t<N notations 
used by Srinivas et al. ( 2O10l ) in order to use their results. 

We show by recursion that, for all D > 1, Kb,d has D + 1 distinct eigenvalues Z^ with 
multiplicities v, 



Vie [1,D],1< D) = Y J BJ (x j -X J+ i)^d^> = (B-l)B 



i-i 



D-i 



3=0 
D-l 

t>+ L - Yl B ° - Xj+i) + B d xd and = 1 
3=0 



(6) 
(7) 



We also show that J b d and Kb,d share same eigenbasis, and the eigenvector Kb,d with 
highest eigenvalue is the vector of ones which is also the eigenvector of J b d with 

highest eigenvalue. 

3.2.1 Proof 

Preliminary result: eigenanalysis of Jb and 3^ Jb has two eigenvalues: with 
multiplicity B — 1 and B with multiplicity 1. We denote by ji-.-js the eigenvectors of Jb, 
in decreasing order of corresponding eigenvalue, ji is the vector of ones. The coordinates 
of jj are notated jn—jiB- F° r a U i from 1 to B we define IH (.) as a concatenation of B 
vectors: 

/ jt,lV 

u! B) (v)= : 

For all i > 2, Yli Jil = by definition of jj. For all n-dimensional vector v and nxn matrix 
M: 



j( B >(M)uJ B) (v 



/ (Efc M lik j it iv k ) + ... + M ltk ji tB Vk) 
\ (Efc M n>k j it iv k ) + ... + (X) fc M ntk j i>B v k ) 



Hence Uj fl) (v) 

is an eigenvector of with eigenvalue equal to 0. 
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Recursion We propose eigenvectors of Kb,d-, use Equation [5] and determine the value of 
each term of the sum multiplied by the proposed eigenvectors, in order to get an expression 
for the eigenvalues. 

• For D = 1. From Equation [U ji...j_B-i are also eigenvectors of Kb,i with eigenvalue 
/j 1 ^ = xo — Xli hence ^ nas multiplicity = B — 1 as expected, is also an 
eigenvector of Kb,i with eigenvalue 1% = Bxi + Xo — Xi> an d v^p = 1- 



Let us assume the result is true for a given depth D — 1. 

— The largest eigenvalue of Kb,d-i is 

D-2 

jiD-i) = b d-i Xd i + £ B\x, ~ Xi+l) 

j=0 

( H\ 

with multiplicity 1. Let us apply XJ B to the corresponding eigenvector 1 b d-i, 
and multiply it to the expression of Kb,d given in Equation [5j 

* V b (1 b d-i) = 1 b d and J( b \J b d-i) is a matrix of ones in B D dimensions, 
hence: 

j^(j bd -i)v^\i bD -x) = ^(lflD-O 

* I^d-i is also the highest eigenvector of J b d-i, with eigenvalue B D_1 , hence: 

* By definition of 1 b d-i and . 

1^(2^0^(1*0-0 = ^uifW-O 

As a consequence, = 1 b d is the eigenvector of Kb,d with highest 

eigenvalue (this will be confirmed later), equal to fp+i = B D XD + Y^j=o &(Xj ~ 
Xj+i)- 

— Let us apply to 1b - 1 f° r an & from 1 to B — 1. 

* Owing to the preliminary result, we have: 



^ B \j B n- 1 )\J { k B \l B n- 1 )=0 



k 

D-l. 



* Since 1 s d-i is the eigenvector of J b d-i with eigenvalue £> 
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* Since 1 b d-i is the eigenvector of K b ,d-i with highest eigenvalue: 

l^\K B ^[ B \l BD -,) = i^uf W-0 
for the same reasons as previously. 

As a consequence, K Bj o^k 0-b - 1 ) = (—XdB ^ 1 +^) 1 )XJjf' (1 b d-i) and 
we have found B — 1 eigenvectors of K b ,d with eigenvalue equal to Zj> = 
^2j=o Bi(xj — Xj+i)- These vectors are also eigenvectors of J b d with eigenvalue 
0, which comes from the preliminary result and the fact that J b d = (J b d-i). 

For i from 1 to D—l, let us apply uj^, for all k from 1 to B, to all (B — 
eigenvectors v of K B d—i with eigenvalue equal to ff . By definition of v: 

lW(^i)uP(v) = r-M B) (v) 
v being also an eigenvector of J b d-i with eigenvalue 0: 

jW^-OuJV) = 
i^W-OufV) = o 



As a consequence, eigenvalues stay unchanged but their multiplicities are all mul- 
tiplied by B (because k goes from 1 to B and we have identified B times as many 
eigenvectors) which gives v\ = {B — 1)B D ~ % . Again, the preliminary result al- 
lows us to show that the \J k (v) are also eigenvectors of J b d with eigenvalue 0. 

— The total number of multiplicities for all found eigenvalues is equal to )+ 
B — 1 + 1 = B D so we have identified all the eigenvectors. 

3.2.2 Re-ordering of the kernel matrix eigenvalues 

In order to match the notations of Srinivas et al. we re-write the eigenvalues as a 



sequence l\ > I2 > ■■■ > In- We first need to reverse the order of the eigenvalues and 
thus consider the sequence of Z^-i's. We obtain the lambda hats by repeating the lambda 
bars as many times as their multiplicities. /( = with i such that B % < t < B %+1 . For 



log(f)-r 
log(B) 



1 < t < N, log(t) = ilog(B) + r with < r < log(B) hence B l < t < B i+1 . i 
from which we have: 

Vt G [l,N],3i G [-1,D- = In-i with log B (t) - 1 < i < log B (t) (8) 

3.3 Linear kernel 

The linear kernel is an inner product in the feature space, which amounts to counting how 
many nodes in common two paths have. It takes values from 1 to D + 1. The normalised 
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linear kernel divides these values by D + 1. If two paths of depth D differ on d nodes, they 
have D + 1 — d nodes in common: 



Xd 



D + l-d 
D + l 



For all j, Xj - Xj+i = VO + 1), hence k = Y?j=a B3 = (B-'i)(D + i) for i < D + 1 
We use Inequality [8] to get a lower and an upper bound on l t for t > 1: 



_/V\g-i°gs(*) _ i 
(5-l)(D + l) 



< it < 



NB^ - 1 
(B-1)(D + 1) 

jy_gl-log fl (t) _ j 
(S-1)(D + 1) 



Vt > 1, 



(5-l)(D + l)< 



< h < 



NB-t 



{B-l)(D + l)t 



The bounds for Zi are obtained by adding B d xd+i to the bounds above. Indeed: 

h = 

D-l 

= ^^(Xi-X J+ i) + i? D XD 
i=o 

D 

= Y J Bj (x j -X 3+ i) + B D X D+i 
3=0 

We thus see that the expression for Z_d+i only differs from the expressions for other ij's by 
an added B l ~ 1 Xi term. 



3.4 Gaussian kernel 

We give an expression for l{ for this kernel, before giving bounds on It and studying the 
influence of the kernel width s on these bounds. 



3.4.1 Value of Xd and Z f 

The squared Euclidian distance in the paths feature space is twice the number of nodes d 
where they differ: path 1 contains nodes indexed by i\...id that path 2 doesn't contain, and 
path 2 contains nodes indexed by ji-.-jd that path 1 doesn't contain, so the i\...id and j\...jd 
components of the feature vectors differ. The components of the difference of the feature 
vectors will be except at the d i-indices and at the d j-indices where they will be 1 or — 1. 
Summing the squares gives 2d. 

Consequently, the Gaussian kernel is an exponential on minus the number of nodes 
where paths differ (from to D): 

Xd = exp(-40 
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For all j, Xj ~ Xj+i = (1 - exp(-4)) exp(--4), hence for all i < D + 1, 



where 



5 



J'=0 



Cs(ql-l) 



a 



5exp( — f )) 

1 _ 2£ 
B 

a.9 - 1 



By definition, q s < B. Let us focus on the case where 1 < q s so that C s is always positive, 
which is equivalent to: 

1 

3.4.2 Bounds on l t 

Once again, Inequality [8] gives us a lower and an upper bound on If. 

C s (q D q- lo ^(t) - 1) < i t < C s (q D q- lo ^q - 1) 

As we will see in the next section, in Inequality [3l we are only interested in t indices 
that are smaller than N. As for the linear kernel, we can bound If by expressions in 1/t. 
Indeed: 









log fl (t) 




t~ 






<r 






t 


1 


< 


<r 


log s (<) 


< 


1 


7 










7 ( 


l 






logs(*) 




A 




< 


<r 


< 




7 










T 



? log(S) 



,-D 



Which thus gives: 



Vt > 1, 



C s (Aexp(-4)-t) 



< it < 



C s (Nq s -t) 
t 



for s > 



\/iog(B)' 



3.4.3 Influence of the kernel width 
From the above we have: 

& < 



NC s q s 
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Note that 



C s q s 



(B- 
B(q s 

+ - 



1 



)(1 



B 



and q s increases when s increases, hence —^—r decreases and — q s decreases. As a result, 
C s q s decreases. Also, since q s tends to B when s tends to infinity, the limit of C s q s is 
when s tends to infinity. The It upper-bound improves over that of the linear kernel when 
s is big enough so that C s q s < rg^WD+T) ■ 

Now, let us look at the rate at which C s q s tends to zero: when s is bigger than 



we have: 



Aog(f)' 



C s q s < 2(1 - exp( 



4» 



< 2(-5 + ( 3 )) 



Hence: 



C s q s = 0(- 



4. GP Tree Search bounds 

Information gain bounds can be turned into high probability regret bounds as seen in Section 
[5J In Section 14.11 we give two information gain bounds which are valid for any kernel with x 
values in [0, 1]. Bound[10]is independent of time and is interesting asymptotically. However, 
in most interesting tree search problems, N is extremely large (consider B = 200 as in Go, 
and D = 10) and the number of iterations T is smaller than N. Bound [9] is log linear in T 
but only involves constants that are small compared to N (in O(BD)), which is interesting 
when T < N. 

In Section 14.21 we derive better bounds that take advantage of the decay of the kernel 
matrix eigenvalues. Bounds [T5l and [TBI when combined, give a function which is linear up 
to a time T*, after which it becomes logarithmic in the number of iterations. For T > N 
the bound becomes independent of time. We show that, in the case of the Gaussian kernel, 
the constants improve in O(jz) when the kernel width s increases. The Gaussian kernel 
bound can be better than the linear kernel bound for large enough values of s. 



4.1 Kernel-independent bound 

We run the algorithm until time T. We could use the feature representation described in the 
previous section, in which case the dimension would be bounded by the number of visited 
nodes which is itself bounded by D T. Alternatively, we could consider feature representa- 
tions which are indicator vectors, in which case the dimension would be bounded by T (we 
have seen T different arms at most). Doing so equates to changing the kernel matrix into 
the identity matrix, which can only worsen the regret since arms become independent. It 
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can be preferable to switch to the feature representation described in the previous section 
for large values of T, since the number of visited nodes is bounded by iV n . 

The algorithm runs with a linear kernel in this feature space, or, if we consider the 
Gaussian kernel, a slightly different feature space with same di mensionality (as seen in the 
previous section). This corresponds to case 1 of Theorem 5 from ISrinivas et al. I (120101 ). For 



a linear kernel, k(x,x') = x T x' and the kernel matrix Kt on training set [xi_...Xy] is equal 
to X^Xt- Let us denote by A the diagonal matrix of eigenvalues l\ > ... > Idim of Xj-X^,. 
The information gained from a training set At can be expressed in terms of Kt'- 

F(A t ) = H(y T ) - H(y T \f T ) 

= H(N(0, K T + a 2 noise I T )) - H(N(f T , a 2 noise I T )) 

= l/21og(|/ T + a-j se A^X T |) 

= 1/2 log ( | / + CnJse ^t^t | ) D Y Sylvester's determinant theorem 

< 1/2 log ( I / + Cn 2 ise A|) by Hadamard's inequality 



dim 



< ^i/2io g (i + ff -yo 

i=l 

< {dim/2) log(l + a~v ise dim) 



This results from the fact that XtX£ is a dim x dim real symmetric matrix with ent ries in 



[0, 1] , hence every eigenvalue is smaller than l\ which itself is smaller than dim (see IZhanl . 
20051 ). Using dim < T and maximising over At, we get: 



r(T)<(T/2)log(l + a- 2 sc T) (9) 

We could have a bound in 0(log(T)) instead by bounding the first occurrence of dim by 
^V n , but the constant would be huge (0(N)) and would make the result less interesting for 
T < N. We could even have a bound independent of time when bounding both occurrences 
of dim by N n : 

r (r) < (N n /2) i og (i + a-j sc Ay (10) 

These bounds do not depend on the Xd values and are therefore kernel- independent. 
4.2 Sum of log-eigenvalues bound 

We start from Inequality [3l Without further knowledge on the m^'s, we simply lower-bound 
them by and upper bound by T. 

In order to exploit our upper-bound on l t , we first bound the sum of log(l + cr~ 2 ise mth) 
by a sum of log(c7t) = log(c') + log(/t) so that a sum of log(/ f ) appears. This can in turn 
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be bounded owing to a result on the sum of log(l/i): 

min(T,AT) T 

log(iA) < Eiog(iA) (ii) 

t=2 t=l 

< log(l/T!) (12) 

< -log(T(r + l)) (13) 

T + 1 

< _ T iog(^) (14) 

e 

using the fact that F(x) > (f ) x ~ l . 

Let us consider the Gaussian kernel El. It being the smallest eigenvalue and rat being 
either zero or bigger than 1, we have either log(l +<7no^se m *^) = or > 1. From there: 

log(l + a~^ sc m t i t ) < log(( J- + cr~o ise )m t i t ) for t s.t. m t ^ 

< iog((i + <£L)H) for alU 

min(T,AT) min(T,Af) 

£ ^(l + ^m^) < log(( r + ( 7- 2 ise )riVC s9s )r+ log(-) + log(fi) 



J*(T) < 



2(1 



i_ Io g (( 1 + ^JiVC^T + D log (2?) 



(15) 



By extracting log(^) terms from I*(T), we take advantage of the log but we also intro- 
duce a J- term and thus we suffer from smooth kernels for which Ij- will be low. 



4.3 Eigenvalues tail-sum bound 

To remedy this, we split the sum at such that the lt>T t are iOW an d it is acceptable to 
bound log(l + cr~ oise mtlt) by cy~ oise mtlt- Thus, we consider the tail-sum of the eigenvalues 
which allows us to exploit quick decay rates for smooth kernels, resulting in small regret 
bounds. 

h < NL(t) where L is a decreasing function, so that we can bound the tail-sum 
Ylt=T*+i h by N times a tail integral of L . The quicker L decreases, the lower the integral, 
hence the lower the information gain bound and the lower the regret. For the Gaussian 
kernel and s > 



log(B) • 



min(T,A0 fmin(T,N) 

^2 k<N I L(t)dt 



t=T, 



I*(T)-I*(T*) < 



N 



2(1 - e -i) 



C s q s log 



min(T, N), 



(16) 



15. The derivations for the linear kernel are very similar and just involve different constants. 
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4.4 Sum to T' where Vt > T', m t = 

We know that the greedy procedure chooses eigenvectors of K among the T that have 
highest associated eigenvalue. However, we might have only picked T" eigenvectors and 
picked several times the same ones (rrii gives the number of times we have picked the i th 
biggest eigenvector of K). We look for the smallest T' such that: 

VT' < t < T, m t = (17) 



t ■ 



The information gain will then be bounded by Ylt=i ^°s(l + °"noise m t^ 

The contrary of Proposition [T7] is equivalent to choosing uy+i at least once. This is 
equivalent to the fact that there exists t, first time we select Uy + i, such that all eigenvalues 

lij of St are smaller than It'+ij = W'+i- This can be written: 

3t < T, Vt < T', J s- < l T , +1 (18) 

1 + Onri«FH.tU 



noise 

1 1-2 

noise 



< cr~ 2 m, f (19) 

7 , — noise l i c \ / 

lT'+l H 

Therefore, Not Proposition [T7] is equivalent to Proposition 1191 Let us assume that the 
latter is true. We know that each m^t is smaller than m^T and that z2i=i m i,T < T, hence: 

rpl 

~ h ^ ^ (2°) 

i=l f T'+l H 

Thus, we can find T' such that Proposition [T71 is true by lower bounding Y2i=iil f ) 

and looking for T' such that this lower bound is equal to cr~? se T. 
Prom the If bounds established in the previous section, we have: 

rpl rpl 

B^-i) > ^ r + 1 

i=i 



h>+i k ~ tt N C s q s C s (NeM-^)-i) 



where Ac 



> -r-T'(T' + 1) 

NC s q s (exp(-g) 



exp(-4) - 1 



thus we look for < T < T such that T' 2 + T' - a~ 2 se A s T = 0: 



; -1 + Jl + 4a- 2 is( AT 
T = y - = O(VT) 

For T big enough, the previous expression is smaller than T. Also, since A s tends to when 
s tends to infinity, we know that the bigger s, the smaller the constant in front of vT in 
the expression for T", and thus the earlier T' < T is true. 

When replacing T by T' in the log of the info gain bound, we can divide the constant in 
front of the log by 2 while slightly increasing the offset, but we cannot improve the growth 
rate in T. 
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4.5 Final bound 

Increasing T* will decrease the eigenvalues tail-sum bound, but it will increase the sum of 
log-eigenvalues bound. Conversely, decreasing T* increases the former bound and decreases 
the latter. We know we picked the optimal T* when the cost to put T* + 1 in the first sum 
would be higher than to put it in the tail sum (regardless of the value of T): 

log((-!- + a^JNC s q a e)(T, + 1) - NC s q s log(T, + 1) 
> log(( J- + a-* se )NC s q s e)% - NC s q s log(T*) 



T* doesn't depend on T but on the kernel: the smoother the kernel, the smaller 
For T < T*, we will only be using the sum of log-eigenvalues bound, resulting in a linear 
information gain bound. For T > T* we will combine the two bounds. The first bound 
becomes a constant when replacing T by T*, and the second bound dictates the rate of 
growth of I*(T): log(T) when T < N and constant otherwise. Thus, the regret is in 
0(7yiog(T)) or in 0{log{T)VT) or in 0(y/T\og(T)). The last two growth rates can be 
rewritten as 0(VT). 



4.5.1 Influence of s 

The regret should improve for smoother kernels, i.e. bigger s. Let us check that this is the 
case with the bounds we gave. As we already noted, C s q s decreases in O(jj). Hence, the 
eigenvalues tail-sum bound is clearly improving for larger values of s. Let us now derive a 
sum of log-eigenvalues bound in terms of s: 

T 1 
£log(l + a-2 se m t Z t ) < \og(( r + a-l e )NC s q s e)T 

It 



t=i 



NC s q s eT _ 2 
~ lOS{ C s (NeM-$)-T) +CT ^ NCsqse)T 



Sexpf-^eT 



BeT 

< log( . 7TT + ^noL^^)T 

exp( — p-) - exp(^) 



The first term of the sum inside the log is decreasing as exp(— ^p^) and — exp(^-) are 
increasing, and the second term of the sum is also decreasing as C s q s is decreasing. 



5. Discussion 

In this section, we discuss the GPTS algorithm and our previous results in relation to other 
algorithms for tree search and planning in MDPs. We then give a few ideas to extend our 
work and conclude this paper. 
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5.1 Tree Search 

We compare GPTS to the Bandit Algorithm for Smooth Trees algorithm. 
5.1.1 Extension of / to non-leaves 

Coauelin and Munod (|2007h extend the definition of / to all nodes: let us call / the function 



which coincides with / on the leaf nodes and which, on any other node n, is equal to the 
maximum value of / on tree paths that go through n. The maximum value of / is /* = /*. 
We can also extend the definition of the suboptimality A n = f* — f{n) of a leaf node to any 
node n at any depth: A n = f* — f(ji). The BAST smoothness assumption on / is that, 
for any r?-suboptimal node n at depth d (meaning A n < 77), there exists 5d > such that 
f(n) — f(i) < 5d for any child i of n. This is only a local regularity assumption as there is 
no assumption on nodes which are not 77-suboptimal. 

5.1.2 Smoothness of / 

For two given nodes n\ and 712 with same parent, there exist two leaves l\ and I2 (with 
ancestors m and ri2, respectively) such that /(wi) — /(«2) = f(h) — f(h)- With the GPB 
assumption, (f(h), f{h)) T lies with high probability within an ellipse determined by the 
kernel between l\ and I2 (equal to the depth of n\ and 712, when considering the linear 
kernel). One can thus say how close the / values of two siblings may be, and thus bound 
f(n) — f(i) in terms of the depth of n, with high probability, in order to give a rough 
comparison with the BAST smoothness assumption. Although this bound is only with high 
probability - while it would always hold with BAST - GPB makes an extra assumption on 
how the / values would be distributed. 

5.1.3 Reward variability 

BAST assumes that the reward at each leaf is always in [0, 1] and is given by a probability 
distribution with mean equal to the / value at that leaf, whereas GPB ass umes that the 



reward distribution is Gaussian with standard deviation <7 no i S e- However, ISrinivas et al 



lave also extended their regret analysis to the more general case where the rewards 



are given by /(xj) + et such that the sequence of noise variables et is an arbitrary martingale 
difference sequence uniformly bounded by <7 no ise • The resulting regret has same growth rate 
in T. 

5.1.4 Regret bounds 



Theorem 4 of ICoquelin and Munosl (|2007l ) gives a regret bound when 5d decreases expo- 
nentially: 5d = 5j d . The bound is written in terms of the parameters of the smoothness 
assumption (namely ij, 5, 7) and is independent of time. However, this bound is problem- 
specific as it involves the inverse of the A m j n quantity, where A m j n = minj{Aj = f* — f(i)}- 
Note that when / has B D possible inputs, 1/A m j n can easily be of the order of B D . 

While the BAST bound may be interesting asymptotically, the number of iterations T 
of the tree search algorithm is unlikely to go past B D for most interesting values of B and 
D. An issue with the 1/A m j ra term is that, the smoother /, the bigger 1/A m j n and the 
bigger the regret - whereas we actually would like to take advantage of the smoothness of 
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/ to improve (decrease) the regret. The non-dependency w.r.t. A m j n usually comes at the 
price of a stronger dependency on time T, as it is the case with UCB 0. 

5.1.5 Tree growing method 

Iterative-deepening The trees we set to search are usually too big to be represented in 
memory, which is why we "grow" them iteratively by only adding the nodes that are needed 
for the implementation of our algorit hm. One meth od of growing the tree is by iterative- 
deepening, used for Go tree search by ICouloml ([20061 ): the current iteration is stopped after 



creating a new node (or reaching a maximum depth); a reward is obtained as a function 
of the visited path (not necessarily of depth D), or as a function of a randomly completed 
path of length D. The resulting tree is asymmetric and contains paths that have different 
numbers of nodes. Hopefully this helps to go deeper in the tree in regions where / has 
high values, and keeps the paths short in the rest of the tree. This saves time and memory 
by stopping the exploration at a depth smaller than D and not creating nodes that would 
belong to sub-optimal paths. 

Depth- first Because we consider tree paths as arms of a bandit problem in GPTS, we 
need all paths to have same length, which results in a depth-first tree growing method. 
Depth-first means that, at each iteration, we add nodes sequentially until reaching a max- 
imum depth D, and then we start another iteration of the tree search algorithm from the 
root. So, unlike GPTS, BAST can be run in either iterative-deepening or depth-first mode. 
Supposedly the algorithm is more efficient in its iterative-deepening version, but no regret 
bound was given for this version. 

5.2 Open loop planning in MDPs 

We compare GPTS to the Open Loop Optimistic Planning algorithm. In Tree Search 
applied to planning in M DPs, the reward is a sum of discounted intermediate rewards. 



appli ed to planning m M P It's, tne reward is a sum ol discounted intermediate rewards. 
Bubeck and Muno assume that these intermediate rewards are bounded in [0,1], 



but it is better to translate them to [—1, 1] if we plan to use the GPTS algorithm (so that 
the prior mean for / can indeed be taken to be 0). 

5.2.1 Choice of a GP model for rewards in discounted MDPs 

We model our belief on what we expect the intermediate reward functions to be, by consid- 
ering, at each node n T in the sequence of actions being explored, a set of random variables 

(n ) (n ) 

, ■■■Fg such that the intermediate reward function values for all possible actions from 
node n T is a realisation of this. We assume that each of these random variables follows a 
normalised Gaussian distribution, and that they are all independent. We now determine 
the tree paths kernel function that follows from this assumption. A path is a list of nodes 
no, fix, no, where no is the root, corresponding to a list of indices ii,...iD of actions 
taken in the environment. Our belief on the function value for this path is represented by 
jOpi^-o) _|_ ^ _|_ ^D-ip{ n D-i) ^ j£ £ WQ pg^hg x anc i x ' have h action indices in common, they 
can be represented by i\, ...ih, ih+i-, —iD an d h, ■■■if l ,i' h+1 , ■■■i' D . The kernel product between 



16. For UCB, the prob lem-specific bound is in 0(log(T)) while the problem-independent bound is in O(VT) 
(see iBubeckl l201Ch 
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these two paths is given by: 

K (x,x') = COv( 7 F (n0) + ... + ^-l^H-x) + h F (n h ) D-l F (n D - 
\ > J \ I i\ I i h 'I i h+1 I i D 

(no) , i ph-i) , h P K) , , „X>-1 p( n £>-iK 



6 1 *7i f» 11 *y 



7 «» 



1 _ 7 2 



where we used the bilinearity of the covariance, the independence of the random variables, 
and the fact that their variances are always 1. This characterises our belief on the discounted 
sum of rewards /. Note that the kernel is not normalised: rc(x, x) = 1 ^J^2 which grows 
with D. This reflects the fact that the signal variance is higher for deeper trees. 

Although OLOP and BAST are very similar in spirit, OLOP exploits the fact that / 
is globally smooth, owing to the discount factor 7. Again, the GP smoothness assumption 
is weaker in the sense that the intermediate rewards are not bounded, but it is stronger 
since we make an assumption on how they are distributed. However, previous studies on 
Bayesian optimisation gives us reasons to think that this may be reasonable in practise. 

5.2.2 Simple regret 

Bubeck and Munod (|20ld ) consider the simple regret /* — /b es t(T) as a m ore appropriate 



measure of performance for a planning algorithm, for which they obtain a bound by dividing 
a cumulative regret bound by T. Since the algorithm outputs the best observed path, 
it might actually be interesting to give a bound on the empirical simple regret, i.e. on 
I /* — Vbest(T) I • A relationship between the empirical simple regret and cumulative regret can 
be given with high probability. First, we define the empirical cumulative regret as R' T = 



Tf* - Yst=\ Vt- iGoquelin and Munosl (120071 ) give a relationship between the cumulative 



regret and the empirical cumulative regret: \Rt — R' T \ = 0{y/T) with high probability. 
Finally: 

f - VbestiT) < T R' T < T RT + 0(-j=) 

with high probability. 
5.2.3 Regret bounds 



The cumulative regret considered by iBubeck and Munosl is measured as a function of the 



number of calls n to the generative model, which is equal to D T for us. Their immediate 
regret for a given policy is defined as the difference between the infinite sum of discounted 
rewards for the sequence of nodes chosen by the optimal policy, and for the sequence of 
nodes given by following our policy for D actions and switching to the optimal policy from 
then on. Consider the t th path exploration. Let us write not the node that we have 
after following our policy for D actions. It may be different from the node n* D that we 
would have had with the optimal policy. For this reason, n* D+1 may not be available after 
riDu which implies that the sequences of nodes that follow can be different, even though 
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we are using the sa me, optimal policy. Consequently, the immediate regret considered by 
Bubeck and Munosl is equal to our regret r t , measured up to depth D, plus 7-° ^t=\ 27* _1 , 
where the intermediate reward differences after D actions are all bounded by 2 (since rewards 
lie in [—1,1]). Thus, stopping the exploration at depth D implies a cost in the order of ^y D 
on the simple regret. Clearly, fixing D implies a simple regret in 0(1) and is a poor choice. 
It is necessary to go deeper down the tree as the number of iterations T - fixed in advance 
- increases. This is why OLOP builds a tree which depth depends on T. 

GPTS can also be used in a similar fashion, by fixing T and choosing D as a function 
of T. This makes N depend on T, and I* (T) will not be bounded by a constant anymore. 
We adapt our previous results by determining Zj for the current kernel. If two paths differ 
on j nodes, they have D(T) — j action indices in common. 



( 7 2)D(T)-(j+l) _ ( 7 2)D(T)-j 

xj-Xj+i = 

= { 1 2 ) D ^- j - 1 

i-1 

k = 2>'0o--x,-+i) 

3=0 

2D(T) £1 B . 



This expression is very similar to the one obtained for the linear kernel, but with in place 
of B and 7 ^ ) in place of jj^pj- As a result, L(t) can be taken as 7 rgz^j^~ which implies 
I*(T) = 0( 7 2D ( T ) N(T)). Taking D(T) = log fl (T) - as OLOP does - implies N(T) = T 
and I*(T) = 0(r 1_21ogfl( 7)). The regret becomes: 



Rt = 0(^TI*{T) log(iV(T))) + 0(T-/ D ^) 

= o(r 1_logB( ^) 



with high probability. This is similar to the OLOP bound for the case where j 2 B > 1, but 
with T instead of n. We write a = log^(^) > 0. We have that 1 — a > since ^B > 1. 
Using n = D T = Tlog B (T) we can show that the OLOP bound in n implies the same 
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bound as ours in T. 

= 6{n- a ) 

< a\og(n~ a ) l3 n 1 ~ a where (3 is even 

= din 1 -") 

< a\og{n 1 - a ) l3 n 1 - a 

< a (l-a)\og{nfT l ~ a \og B (T) l - a 

< a' \og{Tf +1 - a T 1 - a 

- (i _ a )/?+i-a 1C W > 
= 0(T 1_a ) 

Note that the simple regret bound of GPTS in 0(T~ logs( ^) is better than that of 
OLOP in 0(T~a) when < 7 < which is understandable since our assumptions are 
stronger. 

5.3 Possible extensions: a few ideas 

The ideas introduced in this work can be further developed and extended to some particular 
tree search problems, as mentioned in this section. 

5.3.1 Variants with different outputs and stopping criteria 

We mentioned in the introduction that we could use the confidence intervals built by GPB 
to change the output of our algorithm: instead of outputting the best observed path, we 
could output the path with highest lower confidence bound for instance. When outputting 
an optimal action to take from the root, in the MDP planning case, we could output the 
action that has highest estimated reward in the long term. These variants might be more 
robust to the variability of the rewards (we could be misled by an unlikely high reward value 
for a mediocre path), but we do not have regret bounds for them. Confidence intervals can 
also be used to determine a stopping criterion, e.g. stop when the width of the confidence 
interval (at a given confidence threshold) for the best observed action is smaller than a 
certain threshold. We then automatically have a performance guarantee for our algorithm, 
but no guarantee on the runtime. 

5.3.2 Hierarchical optimisation 

We can use GPTS in a manner similar to HOO, with B = 2 and D = 0(log(T)), to find the 
maximum of a function in a space for which we are given a tree of coverings. Each leaf node 
of the tree corresponds to a region of the search space (these get smaller as D increases), 
and we aim at learning the average / values in these regions. The search space can be the 
Cartesian product of an arbitrary (and possibly infinite) family of discrete and continuous 
sets. Hierarchical Optimistic Optimisation has the advantage that the choice of a point to 
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sample the function at is very straightforward, it doesn't require any heuristics as in GP 
Optimisation, and it offers a unified framework for many optimisation problems (not just 
in R d ). 

5.3.3 Modelling dependencies between nodes 

One of the most interesting lines of research for Upper Confiden ce- typ e Tree Search al- 



gorithms is to generalise between nodes of the tree, according to iGellvl (|2007l ). Domain 
knowledge can be incorporated in GPB by encoding heuristics in the prior mean, but also 
by labelling nodes and incorporating a kernel between nodes in our kernel between tree 
paths. 

Go tree search When searching Go game trees, we could simply label nodes by the 
corresponding Go boards. We would then use a kernel between Go boards, applied to the 
leaves of two paths. Nodes would be selected in a sequence, starting from the root and 
computing the upper confidence values of each possible next Go board in order to select the 
next node. The same instance of the GPB algorithm can be used at each time we need to 
search for an optimal move, and the knowledge gained by the algorithm can be transferred 
from one game to the other. 

Trees with labelled nodes Nodes can be labelled by feature vectors, and a natural 
kernel between sequences of nodes would be a product of Gaussian kernels between the 
feature vectors at same depth: this is the same as creating a feature representation of the 
path by concatenating the feature vectors of its nodes and then taking a Gaussian kernel in 
the paths feature space. Let us write K the kernel matrix between the children of a node 
(assuming this matrix is the same for all non-leaf node of the tree). The method of our 
eigenanalysis of Kb,d could be adapted by first writing Kb,d = K(-Kb,z>-i) where K is the 
B x B block matrix with coefficients taken from K (itself a B x B matrix). Here, K takes 
the role of Jb and K takes the role of J^ B \ However, with regard to the implementation 
of the algorithm, it is not clear how the search for the maximum of the upper confidence 
function would be performed. 

Planning in MDPs When generative models are available, we could aim to learn imme- 
diate reward functions at each node, as functions of the children's labels. There would be 
one GPB instance per node, and after the exploration of a path we would train each instance 
for each node along that path with the corresponding immediate reward that was observed 
(this assumes that we can observe immediate rewards and not only the discounted sum of 
these rewards). The selection of nodes would be performed in a way similar to UCT, by 
using a sequence of UCB-type bandit algorithms. We can get simple regret bounds at each 
node n, for f n being the immediate mean-reward function at this node, that take advantage 
of the spectral properties of K. We can then combine these to get simple regret bounds for 
whole paths and / being the discounted sum of immediate mean-rewards. 

5.3.4 Closed-loop planning in MDPs 

Finally, this work might be extended to closed-loop planning in communicating MDPs with 
deterministic transitions, by considering cycles through the graph of states instead of paths 
through a tree of given depth: all cycles have a length smaller than the diameter D of the 



32 



Gaussian Process Bandits for Tree Search 



MDP, which is finite in communicating MDPs. Closed-loop planning differs from open-loop 
in the fact that the chosen actions depend on the current states and not only on time. If no 
generative model of the MDP is available, we would directly interact with the environment 
and we would therefore use the cumulative regret as a measure of performance, since every 
interaction with the environment would have a cost. If there are dependencies between 
actions, GP inference could be used to derive tighter upper confidence bounds to be used 
in an algorithm such as UCycle (jOrtnenlioioh . 



5.4 Conclusion and future work 

To sum up, in this paper we have presented a bandit-based Tree Search algorithm which 
makes use of the Gaussian Processes framework to model the reward function on leaves. 
The resulting assumption on the smoothness of the function is easily configurable through 
the use of covariance functions - which parameters can be learnt by maximum likelihood if 
not known in advance. We have analysed the regret of the algorithm and provided problem- 
independent bounds with tight constants, expressed in terms of the Xd,o<d<D parameters 
of the covariance function between paths. When comparing to other algorithms, we have 
shown in particular that GPTS applied to planning in MDPs achieves same regret growth 
rate as the recent OLOP algorithm. 

We believe that the analysis presented in this paper will provide groundwork for studying 
the theoretical properties of some of the extensions mentioned previously. It should also 
be possible to extend our results to the more agnostic setting wh ere f is a func t ion w ith 



finite norm in a given RKHS, by using another bound derived by ISrinivas et al. 



although that bound involves a I*(T) term instead of y/l*(T), and it is not known yet 
whether it is optimal. It may also be of interest to derive bounds for non noisy observations 
of / (for planning in deterministic environments, for instance), and to analyse the number 
of times we play sub -optimal arms, in order to get problem-specific bounds (as done by 



Audibert et all 120071 ) . Finally, to complement the theoretical analysis of the algorithm, we 
should investigate the performance of GPB on practical Tree Search problems. In particular, 
it would most certainly be interesting to see if the use of a kernel between Go boards could 
be beneficial compared to other techniques used in Go AI such as UCT-RAVE. 
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